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ABSTRACT 

The challenge presented by high-throughput se- 
quencing necessitates the development of novel 
tools for accurate alignment of reads to reference se- 
quences. Current approaches focus on using heuris- 
tics to map reads quickly to large genomes, rather 
than generating highly accurate alignments in cod- 
ing regions. Such approaches are, thus, unsuited for 
applications such as amplicon-based analysis and 
the realignment phase of exome sequencing and 
RNA-seq, where accurate and biologically relevant 
alignment of coding regions is critical. To facilitate 
such analyses, we have developed a novel tool, RAM- 
ICS, that is tailored to mapping large numbers of se- 
quence reads to short lengths (<10 000 bp) of coding 
DNA. RAMICS utilizes profile hidden Markov models 
to discover the open reading frame of each sequence 
and aligns to the reference sequence in a biologi- 
cally relevant manner, distinguishing between gen- 
uine codon-sized indels and frameshift mutations. 
This approach facilitates the generation of highly ac- 
curate alignments, accounting for the error biases of 
the sequencing machine used to generate reads, par- 
ticularly at homopolymer regions. Performance im- 
provements are gained through the use of graphics 
processing units, which increase the speed of map- 
ping through parallelization. RAMICS substantially 
outperforms all other mapping approaches tested in 
terms of alignment quality while maintaining highly 
competitive speed performance. 

INTRODUCTION 

The issue of accurate pairwise sequence alignment is com- 
mon to many fields in bioinformatics, whether as a core tool 
in fields such as reference -guided genome assembly (1-6) or 
as the seed for the generation of a progressive multiple se- 



quence alignment (7-9). Ideally, analysis of a pairwise align- 
ment should proceed in the knowledge that no inherent bias 
exists as a result of the alignment approach used. This is 
particularly challenging in the era of high-throughput se- 
quencing, where every platform produces systematic errors 
(10-15) that should be considered in constructing an ahgn- 
ment. 

The alignment of coding DNA, in particular, presents 
a unique challenge as it is critical that the final alignment 
takes into account the correct reading frame. Preserva- 
tion of the reading frame ensures correct calling of gene 
structure (16-18) and SNPs, whether in, for example, the 
realignment phase of an exome sequencing pipeline (19), 
single-nucleotide polymorphism (SNP) calling from exist- 
ing RNA-seq data (20,21) or ampHcon-based analyses such 
as human immunodeficiency virus (HIV) drug resistance 
genotyping (22). When aligning coding DNA generated us- 
ing high-throughput sequencing platforms, it is critical that 
codons present in the open reading frame remain intact 
and that codon-sized insertions and deletions are recog- 
nized and called correctly, as distinct from both genuine 
frameshifts and single indels created through sequencing er- 
ror. 

The landscape of pairwise alignment and reference map- 
ping tools for high-throughput sequencing data is broad. 
Tools such as BOWTIE 2 (2) and BWA-MEM (3), while 
well-suited to mapping the location of query sequence 
reads within a complete reference genome, lack the sub- 
tle nuances required to correctly distinguish spurious indels 
from genuine codon-sized mutations in coding DNA. Other 
tools, such as MOSAIK (6) and SSAHA2 (5), perform 
full mapping and realignment using a Smith-Waterman 
approach, with a focus on correcting next-generation se- 
quencing errors. However, basic Smith-Waterman align- 
ment, even when taking into account quality scores as in 
MOSAIK and BWA-MEM, is not appropriate for refer- 
ence mapping of coding DNA as it fails to maintain the in- 
tactness of codons. Finally, the Genome Analysis Toolkit 
(GATK) (23,24) performs the realignment step for tools 
such as BWA-MEM in, for example, the 1000 Genomes 
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project pipeline (25), but does not consider the biological 
context (coding or non-coding) in its alignment. 

One approach to ensure codons are maintained for down- 
stream analysis, taken by the RevTrans program (26), is to 
initially translate query sequences into their corresponding 
amino acid sequences, align them to a translated reference 
sequence at the amino acid level and then 'back-align' the 
nucleotide sequences based on the amino acid alignment 
(26). While effective in some cases, this approach is use- 
less in the presence of indels in the open reading frame, 
which results in mistranslation to amino acid space. Tools 
such as transAlign (27) repetitively translate, align, back- 
translate and correct multiple sequence alignments, which 
to some extent addresses frameshift errors. This approach 
is, however, time-consuming, as in order to align robustly it 
requires a full amino acid multiple sequence alignment to 
create a back-translated DNA profile to which low-scoring 
sequences (assumed to be frameshifted) are compared. An 
alternative is to use an approach that undertakes pairwise 
alignment in 'codon space' — weighting the likelihood of 
each nucleotide substitution within the context of the codon 
to which it belongs. This, thus, facilitates the identification 
of indels, whether genuine frameshift mutations or the re- 
sult of sequencing error. 

Initial models of coding DNA such as those used by Pair- 
Wise (16) and EST_Genome (28) were designed specifically 
for gene prediction, comparing sequences against all six 
translated frames simultaneously using a classical dynamic 
programming approach to discover exons. However, this ap- 
proach was found to be time-consuming and overly specific 
(29). The HMMER tool (30) took a different approach, us- 
ing profile hidden Markov models to align and compare 
protein sequences. GeneWise and GenomeWise (29) use a 
hidden Markov model approach to allow for the alignment 
of a nucleotide sequence, potentially including frameshifts, 
to an existing protein sequence, and thus represent the clos- 
est approximation to a codon-codon alignment tool. How- 
ever, because the alignment is to a protein sequence, it is 
impossible to penalize synonymous mutations. Protein mis- 
matches are all scored equally, as are single -nucleotide inser- 
tions resulting from sequencing error. This prevents the use 
of GeneWise and GenomeWise to correctly identify and ac- 
count for homopolymer errors in next-generation sequenc- 
ing platforms. 

Here, we present a novel reference mapper (RAMICS: 
rapid amplicon mapping in codon space) that uses a hidden 
Markov model approach to align large numbers of coding 
DNA sequences to a coding region of a reference sequence 
in an accurate and biologically relevant manner, account- 
ing for the inherent error biases of the sequencing platform 
used. RAMICS is developed to be executed on graphics pro- 
cessing units (GPUs) and, thus, provides significant speed 
improvements over other, similarly complex, mapping tools. 
RAMICS is, thus, ideal for any situation where a large num- 
ber of sequence reads from a known, coding location in a 
genome must be mapped to a coding reference sequence. 
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Figure 1. The RAMICS hidden MaAov model. Match states are shown 
as circles, delete states as squares and insert states as diamonds. The states 
'F and 'D' are whole-codon insertions and deletions, and are the only self- 
transitioning states. One copy of this HMM exists for every codon in the 
reference sequence, and they are chained together to build a profile HMM. 
Dashed lines indicate an unshown transition to the next set of states in the 
chain.. 

MATERIALS AND METHODS 

The open reading frame 

RAMICS is designed specifically to align coding DNA. The 
tool first discerns the optimal open reading frame of the ref- 
erence sequence by selecting a frame that would minimize 
the number of stop codons. RAMICS then performs a pair- 
wise alignment to each query sequence in 'codon space' us- 
ing a hidden Markov model (31) (Figure 1 and Supplemen- 
tary information). Single- and double-nucleotide indels are 
identified and accounted for, thereby ensuring that biologi- 
cally relevant complete codons are kept intact. This enables 
a correct and relevant comparison of a query sequence to 
the reference, as the concept of a codon-sized insertion that 
begins within one codon and ends within another (as would 
be output by most conventional alignment approaches) is 
structurally meaningless in a biological context (Figure 2A). 

RAMICS uses models of codon substitution (32-34) 
to give synonymous codon substitutions higher transi- 
tion probabilities as well as favoring more likely non- 
synonymous substitutions. RAMICS can also be set to use a 
simple non-coding substitution model to aHgn non-coding 
DNA to a non-coding reference sequence. However, RAM- 
ICS cannot concurrently align sequence reads to coding and 
non-coding regions of the same reference sequence (e.g. a 
reference sequence containing both introns and exons), and, 
as such, is unsuitable for use as the first-pass mapping tool 
in full-genome association studies. It can, however, be sub- 
sequently used for targeted realignment of coding regions 
as a second-pass mapping tool. 

The sequencing platform 

RAMICS uses reported average error rates for given se- 
quencing platforms (10,12,13,15) to set the transition prob- 
abilities to single- and double-nucleotide indel states in the 
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Figure 2. (A) 'Biologically relevant alignments'. Sequences X and Y differ from each other by a three-nucleotide insertion. Conventional nucleotide- 
based approaches generate a statistically optimal pairwise alignment matching as many nucleotides as possible. RAMICS, on the other hand, considers 
the biological importance of codons and generates a biologically relevant alignment preserving the codon structure of the coding DNA being aligned. 
(B) 'Correctly accounting for homopolymer errors'. Comparing alignments of an HIV pol gene sequence containing a homopolymer error (additional 
adenosine, green arrow) to a reference sequence shows that RAMICS correctly identifies and accounts for this error The codon of interest in the reference 
sequence is marked with a black box. Conventional nucleotide alignment approaches would call this codon as AAA (flagging the G as an insertion) which 
encodes for lysine (sensitive form), while RAMICS would call the codon as AGA (arginine, the resistant form) correctly identifying and discounting the 
extra adenosine introduced as a result of the sequencing error at the homopolymer region. 



profile hidden Markov model. In addition, the model is 
more likely to place a single-nucleotide indel on either side 
of a homopolymer region when the sequencing platform 
is set to Roche/454 (35) or Ion Torrent Personal Genome 
Machine (36), as homopolymer miscalls are a common se- 
quencing error on these platforms (13). 

To illustrate the necessity of considering the sequencing 
platform used to generate the sequence data, we take an 
example from HlV-1 antiretroviral drug resistance testing. 
The drug resistance profile of a sequence is determined by 
characterizing mutations at various amino acid positions 
throughout the polymerase gene mapped in relation to the 



HXB2 HIV reference sequence (51). Drug resistant muta- 
tions include the mutation of lysine (K) to asparagine (N) 
at codon 103 (K103N), which confers drug resistance to cer- 
tain non-nucleoside reverse transcriptase inhibitors (NNR- 
TIs) (22). 

The K103N codon is located in a homopolymer region 
and some high-throughput sequencing platforms such as 
Roche/454 (35) and Ion Torrent PGM (32), as well as to 
a lesser extent the Illumina platform (37), are prone to in- 
creased error rates in regions containing homopolymers, as 
they cannot successfully identify the correct number of nu- 
cleotides in these regions (12,13). Incorrect assignment of a 
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homopolymer indel during the mapping step would result 
in a viral sequence being falsely called as susceptible when 
it is actually resistant (54). 

A standard Needleman-Wunsch alignment to the HXB2 
reference sequence of a sequence containing both the 
K103N mutation and a spurious extra adenosine directly 
following homopolymer region would result in the virus in 
question being falsely classified as susceptible to NNRTIs 
such as efavirenz (Figure 2B). Mapping of the same data 
using RAMICS, however, weighs the possibility of a false 
tyrosine (T) insertion in the center of a homopolymer re- 
gion against a false adenosine (A) insertion at the end of the 
region, assigns the latter a higher probability and, thus, cor- 
rectly calls the presence of the K103M mutation in spite of 
the hompolymer error (Figure 2B). Thus, a major strength 
of RAMICS is in its application to situations where the cor- 
rectness of alignment in a coding region is crucial, such as 
alignment optimization in RNA-seq and exome sequencing 
(19-21,38,39). 

Training 

It can be a common case in reference mapping, particu- 
larly for amplicon-based analyses, that a set of query se- 
quence reads will be more closely related to each other 
than they will to a given reference sequence — this is true in 
reference-guided assembly (40), drug-resistance genotyping 

(41) and applications in metagenomic function prediction 

(42) . However, a comparison to the reference sequence is 
still required for these applications. RAMICS circumvents 
any potential bias introduced through a distantly related 
reference sequence by allowing the use of the reference se- 
quence as a guide for the initial mapping and then itera- 
tively training the underlying hidden Markov model to the 
resulting pairwise alignments. The result is a set of emission 
and transition probabilities that will correctly align outlier 
sequences (potentially with many sequencing errors) in the 
query set, based on the substitutions in sequences more sim- 
ilar to the reference. This approach generates a global refer- 
ence alignment that approaches a reference-seeded multiple 
sequence alignment, and which can be used as such. 

Output formats 

RAMICS produces results in various formats with the 
choice of output format very much driven by the end-users 
requirements: 

(i) A global nucleotide alignment of all query sequences 
to the reference sequence, in FASTA format. 

(ii) A derivation of the above global nucleotide alignment 
in which all single- and double-nucleotide insertions 
are widened to the width of a full codon using gaps 
(codons.fasta). This does not alter the sequences, but 
allows the open reading frame to be read clearly in 
alignment viewers such as SeaView (43). 

(iii) A global alignment with all single- and double- 
nucleotide insertions removed (clean. fasta). This 
format does edit the sequences by assuming that 
all single- and double-indels are the result of 
sequencing/polymerase chain reaction (PCR) error. 



While this is extremely useful in some amplicon-based 
analyses where frameshift mutation can be safely 
treated as PCR/sequencing errors, it should not be 
used in cases where a frameshift mutation may have 
biological relevance. 

(iv) A global alignment containing only one instance of 
each identical sequence with a count of the number of 
contributing sequence reads added to the sequence de- 
scriptor. 

(v) A set of SAM files describing mapped reads, with soft 
clipping used beyond the end of the reference sequence. 

(vi) A file containing a set of pairwise gapped FASTA for- 
mat alignments of each query sequence to the refer- 
ence. 

Evaluation of RAMICS 

RAMICS is designed to map large numbers of high- 
throughput sequencing reads from a coding DNA region 
to a similar reference sequence, both quickly and accu- 
rately. RAMICS also has potential as an alignment refiner 
for the coding regions of whole-genome mapping or as 
the seed pairwise aligner for a profile HMM multiple se- 
quence aligner in the style of the protein sequence aligner 
Clustal Omega (44). Both as a standalone amplicon map- 
per and as a component of these larger toolchains, it can 
align large numbers of reads quickly as well as aligning 
reads accurately in regions that have similar biological fea- 
tures but contain many synonymous or functionally similar 
substitutions. Although RAMICS has the ability to remove 
frameshifts in reads, under the assumption that they are se- 
quencing errors, we did not remove any indels before eval- 
uating the alignments. 

We first compared the quality of alignments with RAM- 
ICS in non-coding mode, coding mode without training 
and coding mode with training, respectively. We then com- 
pared RAMICS for quality and speed against Bowtie-2 (2), 
BWA-MEM (3), BWA-MEM post-processed by the Indel- 
Realigner from the GATK (23), MOSAIK (6), the basic 
Needleman-Wunsch aligners EMBOSS 'needle' and 'water' 
(45) ('needle' is used where a global alignment is more ap- 
propriate, 'water' for local alignment), NextGenMap (46), 
SSAHA-2 (5) and SHRiMP2 (1). These tools represent a 
mix of global, local and 'glocal' alignment techniques (47), 
as for the mapping of short reads to a relatively short am- 
plicon all methods are potentially valid. 

Quality performance 

The RABEMA benchmark. Several benchmarks for the 
mapping of simulated next-generation sequencing data ex- 
ist (48,49). The SEAL benchmark (49) is not easily exten- 
sible to new alignment tools, so we first performed a sim- 
ple benchmark of RAMICS' relative abihty to correctly de- 
align spurious indels using the Rabema benchmark (48). 
Rabema uses the MASON read simulator (50) to simu- 
late artificial short reads containing sequencing errors from 
a user-provided reference sequence and to output a 'gold 
standard' SAM files containing each of these reads mapped 
to the reference sequence. The alignments generated by a 
mapping tool are then compared to the 'gold standard' 
alignments. 
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The tool then bins reads by the number of errors found, 
ranging from perfect reads to highly erroneous outlier reads 
containing 4% or more errors. This allows us to test align- 
ment performance not only of reads exhibiting the average 
error rates reported by manufacturers but also of rare out- 
lier reads that exhibit a higher error rate than the reported 
average. The Rabema output gives the percentage of cor- 
rectly mapped reads as a function of percentage sequencing 
error in the read. We ran each benchmark 10 times with 10 
different random seeds to ensure statistical significance. 

The AFAB benchmark. While Rabema is a good bench- 
mark of a mapping tools ability to detect sequencing er- 
rors such as spurious indels, it only determines whether an 
alignment is correct or incorrect and, thus, does not al- 
low quantification of how incorrect the alignment is from 
the 'gold standard'. Further, because all of the reads gen- 
erated by Rabema originate from the same reference se- 
quence, the levels of diversity of the query sequences being 
mapped are lower than what would be generated in many 
genuine sequencing experiments. Thus, we developed a flex- 
ible alignment benchmark (AFAB) that could evaluate not 
only whether a correct alignment was found, but also how 
similar this alignment was to a gold standard alignment, in 
the case where very divergent sequences made perfect map- 
ping impossible. This was achieved using a combination of 
tools. 

We began by selecting a 'gold standard' alignment of 
protein coding DNA which was the highly manually cu- 
rated HIV-1 group M subtype reference alignment of the 
HIV envelope gene from the LANL HIV reference sequence 
database (51). This ahgnment was chosen for two reasons: 
(i) it encompasses the broad spectrum of diversity present 
within the HIV-1 group M subtypes (pairwise genetic di- 
versity measured with the generalized time-reversible sub- 
stitution model (52) ranged from 0.15 to 0.24 substitutions 
per site, median 0.218) and (ii) the envelope gene comprises 
both conserved and variable regions. This ensured that the 
benchmarking was performed on a complex example of 
'real-life' data to which mapping tools are applied. 

Using one representative sequence per subtype we ex- 
tracted all possible pairwise alignments resulting in 55 
unique alignments. For each alignment we used ART (48) 
to generate artificial Roche/454 and Illumina reads (1000 
reads per pairwise alignment) from the second sequence of 
the pair. Further, ART generated SAM files containing the 
correct alignment of each simulated read to the reference 
sequence (the first sequence), which were used as the 'gold 
standard' for evaluation of the accuracy of the various map- 
ping tools. The choice of ART has the added benefit of cre- 
ating greater test coverage, ensuring that RAMICS is not 
just incidentally biased to MASON's model of read simu- 
lation. For both the Roche/454 and Illumina data the qs- 
core tool (8) was used to evaluate each of the mapping tools, 
giving the Q Score and Cline shift similarity scores of each 
tool's generated alignment to the gold standard. 

We then ranked the Q Score and Cline shift for each tool 
that chose to map each read, without penalizing tools that 
chose not to map the read as Rabema does. When multiple 
tools achieved the same score (most often 1.0 in the case 
of trivially mappable reads), we assigned each tool the av- 



erage score they would receive, to avoid artificially inflating 
the scores of tools that chose to map only trivially mappable 
reads. This follows the ranking scheme used in the Friedman 
test (53) and removes the assumption of normality across 
easy and hard to map reads. We then report the average 
score obtained across all reads mapped, with a score closer 
to 1 indicating a consistently higher ranking of a mapping 
approach. We repeated each benchmark five times with five 
different random seeds to ensure statistical significance. 

Speed performance 

We benchmarked the wall clock time of RAMICS, and 
other tools, in mapping amplicon-style data. We mapped 
one million Roche GS Junior FASTQ reads (average length 
237 base pairs) from the HIV reverse transcriptase (RT) 
gene to a short, 312 base pair region of the HXB2 HIV- 
1 reference sequence (54). All benchmarks were performed 
on an Intel® Core™ i5-3210M CPU @ 2.50 GHz and an 
NVIDIA® GeForce® GT 525M graphics card. 

RESULTS AND DISCUSSION 

Evaluation of the RAMICS algorithm 

Initially we evaluated the various components of the RAM- 
ICS algorithm to explore whether the addition of com- 
plexity improves the quality of the resulting alignment. We 
mapped both the Roche/454 and Illumina versions of the 
AFAB benchmarking data using the simplest incarnation 
of RAMICS (without codon ahgnment or training), RAM- 
ICS with codon alignment and the full RAMICS algorithm 
(with both the codon-based alignment and training param- 
eters invoked). The addition of codon-based ahgnment in- 
creased the number of perfectly aligned sequence reads by 
7.9 and 10.5% for the Roche/454 and Illumina data, respec- 
tively (Figure 3). Further, including three rounds of train- 
ing of the HMM to the query data provided marginal im- 
provements in the number of perfectly mapped reads by 
1.5% (Roche/454) and 0.9% (Illumina). The number of dis- 
carded reads remained mostly unchanged regardless of the 
parameters used (Figure 3) showing that while alignment 
accuracy increases with algorithmic complexity, RAMICS 
is consistently capable of distinguishing between mappable 
and unmappable reads. The lesser impact of the training 
parameter can be explained by its intended use case; the 
proper alignment of extreme outlier reads with large num- 
bers of sequencing errors. Thus, in most instances we sug- 
gest that RAMICS should be used with both the codon- 
based alignment and training parameters invoked. For all 
further benchmarking analysis we used the full RAMICS 
algorithm including both the codon-based alignment and 
training parameters. 

Comparison of RAMICS to other tools 

The use case for which RAMICS is designed is rather unhke 
that for which most other mapping tools are tailored, be- 
cause of its focus on coding DNA. Despite this, many stud- 
ies have used various mapping tools to analyze sequence 
reads in ways similar to that which RAMICS was designed 
for (55-58) and, thus, we present comparisons of RAMICS 
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Figure 3. Alignment accuracy comparison of the components of the RAMICS algorithm. For each mapping approach, we show the percentage of simulated 
(A) Roche/454 and (B) lllumina reads that were misaligned (i.e. not discarded by the mapping tool but with an incorrect alignment when compared to the 
'gold standard') as well as the percentage of reads that resulted in a perfect alignment when compared with the 'gold standard' alignment. For both of the 
data sets the most complex model (coding and training) performs best. 



to a number of commonly used mapping tools on the basis 
of alignment quality and speed. 



Comparison on the basis of alignment quality 

The Rabema benchmark. Initially we compared RAM- 
ICS' mapping quality against other tools using the Rabema 
benchmark (39) for both Roche/454 and lllumina data. 
Reads were simulated from a single reference sequence us- 
ing the reported error rates of the sequencing platforms as 
encoded by the Mason read simulator (50). These resulting 
sequence reads are then binned by their percentage differ- 
ence from their seed sequence. For example, if the seed se- 
quence used to generate a read was 100 nucleotides long and 
the resulting read had three differences relative to its seed, 
then that read would be added into the 3% error bin. This 
allowed the evaluation of the mapping tools using realistic 
error rates for both typical and outlier sequences. As would 
be expected, the observed error rate for vast majority of se- 
quence reads for the lllumina data set fell around 1% with 
88% of sequences exhibiting an error rate of <1%. For the 
454 data, 49% of the sequence reads exhibited an error rate 
of < 1% while the error rate of the remaining reads was in ex- 
cess of 1%. The higher-than-expected number of sequences 
with an error rate of >1% most likely results from the fact 
that the single reference sequence used to seed the simula- 
tions contained a number of homopolymer regions, which 
are known to increase the overall observed error rate of the 
Roche/454 platform (12,13). 



For the Roche/454 benchmark, the performance of all 
of the tools tested was comparable for reads containing 
zero errors, ranging from 100% to 97.9% of reads correctly 
mapped for RAMICS and SSAHA, respectively (Figure 
4A). Such high performance is to be expected, as these se- 
quence reads are identical to the reference sequence. As the 
percentage of erroneous nucleotides in a read increased, the 
accuracy of each of the mapping tools, with the exception of 
RAMICS, reduced (Figure 4A). SHRIMP (1) was excluded 
from the Roche/454 benchmark as it discarded over 70% 
of the reads containing 4% erroneous nucleotides with the 
recommended settings. 

A similar result was observed for the lllumina bench- 
marking data with all methods correctly mapping 100% of 
reads at zero error rate and the accuracy of the various ap- 
proaches, with the exception of RAMICS, reducing as the 
percentage error rate increased (Figure 4B). For the lllu- 
mina benchmark data we observed that compared to the 
Roche/454 data, the rate of reduction of accuracy is much 
sharper as the error rate increases. We suggest this effect is 
due to the shorter lengths of lllumina reads, which leave 
fewer correct nucleotides on either side of an error with 
which to anchor an alignment. 

RAMICS outperforms all of the other approaches for 
both the Roche/454 and lllumina benchmarking data sets. 
This is because RAMICS renders sequencing error in reads 
when aligning to a reference sequence almost irrelevant, 
even in outlier reads with a large number of errors, by cor- 
rectly identifying likely sequencing errors. MOSAIK was 
the next best performing tool (albeit with a reduction of 8% 
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Figure 4. Alignment accuracy using the Rabema benchmarks. The percentage of reads mapped correctly as a function of the percentage error in simulated 
sequence reads for both the Roche/454 (A) and Illumina (B) sequencing platforms. Reads are binned together by the Rabema tool according to the 
percentage of erroneous nucleotides in the simulated read relative to its seed sequence. The numbers in parentheses following the percentage error represent 
the mean percentage of simulated reads (across the five Rabema runs) that were binned by Rabema into that error category. We have excluded BWA-MEM 
without GATK as it performed identically to BWA-MEM + GATK. SHRIMP was excluded from the Roche/454 benchmark as it performed extremely 
poorly on this test. 
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in accuracy at an error rate of 4% in the Illumina data), 
while SHRIMP and SSAHA2, both of which conserva- 
tively choose not to align reads, quickly deteriorate in their 
mapping ability with increasing error rate (Figure 4). The 
GATK IndelRealigner (23) made no statistically significant 
difference to the test results, so we have removed it from 
Figure 4 for clarity. 

The AFAB benchmark. The Rabema benchmark clearly 
shows that RAMICS outperforms all other mapping ap- 
proaches on the outlier sequences with a higher-than- 
average error rate. However, these data are not a true reflec- 
tion of the type of data for which RAMICS has been opti- 
mized, namely mapping a diverse set of sequence reads to 
a related reference sequence. Rabema data only differ from 
the reference sequence through the insertion of sequencing 
errors and, thus, we developed a benchmarking tool to en- 
able the assessment of mapping tools at aligning coding se- 
quence reads containing sequencing errors that are diverse 
from the reference sequence. In contrast to the Rabema 
benchmarking tool, this approach enabled the use of two ro- 
bust methods for evaluating relative alignment accuracy en- 
abling the identification of perfectly aligned reads and mis- 
aligned reads. Reads discarded during mapping were not pe- 
nalized in the final ranking thereby allowing tools to discard 
ambiguous reads without incurring unnecessary bias in the 
benchmarking process. 

Extraction of all possible pairwise alignments of subtypes 
A through G from the 'gold standard' HIV alignment re- 
sulted in 55 sequence pairs, each of which were evaluated 
using this approach. Firstly, we considered the percentage 
of reads that each mapping approach was capable of align- 
ing regardless of the quality of the alignment. Needle (45) 
does not possess the capability to discard any reads, result- 
ing in 100% of reads being mapped for both the Illumina 
and Roche/454 data (Figure 5). The percentage of reads re- 
tained by the other approaches ranges from 38% (SHRIMP) 
to 94% (MOSAIK) for the Roche/454 data and 20% (BWA- 
MEM) and 62% (RAMICS) for the Illumina data (Figure 
5). 

As expected, because these data contained sequencing 
errors and were diverse from the reference sequence, the 
percentage of perfectly mapped reads was lower for all ap- 
proaches than was observed for the Rabema benchmarking 
data. For the Illumina data, RAMICS significantly outper- 
forms all of the other mapping approaches by aligning 62% 
of the total number of reads perfectly, with the next best 
approach (needle) mapping 48% of all reads correctly (Fig- 
ure 5). For the Roche/454 data RAMICS performed best, 
correctly mapping 36% of reads perfectly, while the per- 
formance of the other methods ranges between 19% (MO- 
SAIK) and 34% (SSAHA2). The GATK IndelRealigner 
made no difference to the outcome for BWA-MEM, so we 
have removed standalone BWA-MEM from Figure 5 for 
clarity. 

It is also interesting to consider the read discarding strate- 
gies employed by the various approaches. Ideally, a map- 
ping tool would discard all reads that it could not align per- 
fectly, while still aligning as many reads as possible. Consid- 
ering that all reads in this benchmark data set are simulated 
from a gold standard alignment, Bowtie and BWA-MEM's 



strategy of discarding almost 80% of the reads leads to a 
high rate of perfect alignment, but an unacceptable loss of 
coverage depth in situations where many variants are being 
aligned or deep coverage is necessary. 

For all of the reads that were aligned by one or more map- 
ping tools we used two approaches for evaluating alignment 
quality, the Q Score (59) and Cline shift score (60), to rank 
each of the mapping tools for each of the 5500 pairs of se- 
quences reads that were tested. 

The perfect mapping tool would rank first or joint first 
compared to all other tools for every single comparison. 
When two or more tools received equal ranking ties were 
broken according to the Friedman test (53) by assigning, 
for example, a ranking of 1.5 to each tool if two tools were 
equally first, preventing a bias toward tools that aligned 
more reads. The average Q Score ranks for the methods 
ranged from 2.512 to 4.846 and 2.978 to 4.562 for the Il- 
lumina and Roche/454 data, respectively (Table 1). The av- 
erage CHne shift ranks range from 2.522 to 4.565 and 3.039 
to 4.817 for the Illumina and Roche/454 data, respectively. 
For both benchmarking data sets RAMICS was the high- 
est ranked mapping approach tested, with BWA-MEM per- 
forming worst over both data sets (Table 1). The addition of 
the GATK (23) to the BWA-MEM toolchain, the approach 
used in the 1000 Genomes Project (25), had a negligible ef- 
fect on the quality of alignment for this use case. 

Thus, for both benchmarking approaches used across 
both Illumina and Roche/454 data, RAMICS outperforms 
all other methods tested here in terms of alignment qual- 
ity. We must reiterate, however, that all approaches here are 
being tested against data that have been simulated to test 
the use case for which RAMICS has been specifically de- 
signed. While RAMICS outperforms all of the mapping ap- 
proaches tested here, it is context based (focused on coding 
DNA that is divergent from the reference sequence) and, 
thus, it should not be assumed that RAMICS outperforms 
all methods in all scenarios, e.g. whole-genome mapping. 

Speed performance 

While alignment accuracy for mapping tools is essential, the 
scale of the data sets being generated by high-throughput 
sequencing approaches means that mapping speed is also a 
critical factor for an optimal method. With this in mind we 
have designed RAMICS to harness the computing power of 
graphic processing units (GPUs) for optimal speed. We have 
also developed a CPU-based version for researchers who do 
not have access to GPU hardware. Here, we have compared 
the mapping speeds of both the CPU and GPU versions of 
RAMICS against the other mapping approaches in map- 
ping one milHon Roche/454 reads from the HIV RT gene to 
a reference sequence. We found that, on average, the speeds 
for all approaches scale linearly as the data set size increases 
(data not shown) and find that in most instances NextGen- 
Map is the fastest of the tools tested while MOSAIK is the 
slowest (Figure 6). Not surprisingly, we find that the GPU 
version of RAMICS outperforms the CPU version with as 
much as a 3-fold speedup. Nonetheless we find that RAM- 
ICS performs in line with most of the read mappers tested. 

The majority of the other mapping tools are built to ahgn 
reads quickly to a large genome using various heuristics 
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Figures. Alignment accuracy for coding data. For eacli mapping approacli evaluated we show the percentage of simulated (A) Roche/454 and (B) Illumina 
reads that were misahgned (i.e. not discarded by the mapping tool but with an incorrect alignment when compared to the 'gold standard') as well as the 
percentage of reads that resulted in a perfect alignment when compared with the 'gold standard' alignment. We have excluded BWA-MEM without GATK 
as it performed identically to BWA-MEM + GATK. 



Table 1. The average rank over aligned reads only for the given mapping tools 



Tool 


Illumina rank 


Roche/454 rank 


Bowtie 2 


4.00 


3.76 


BWA-MEM 


4.13 


4.00 


MOSAIK 


3.98 


3.98 


needle 


2.97 


3.70 


NextGenMap 


3.45 


3.80 


RAMICS 


2.43 


2.46 


SHRIMP 


3.71 


3.82 


SSAHA2 


3.37 


3.43 



For both test data sets, the highest ranked mapping approach is marked in bold. 
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Speed versus average quality (0 Score) for alignment of Roche/454 reads 



10'' 



t 



1 



NextGenMap j 



(bwamem) 

[ BWA + GATK ) 



(SSAHA2] 



I 10^ 



r-^ , (SHRiMP 1 

[needlej^ ' \ 



[ Bowtie ] 



RAMICS GPU 



( RAMICS CPu] 



[mosaik] 



0.980 0.985 

Average 0 Score for Roche/454 reads 



0.990 



Figure 6. Comparing mapping tools in terms of botti speed (reads mapped per second) and accuracy (average Q Score of tlie Roche/454 reads generated 
using the AFAB benchmarking tool). The error bars on each data point reflect the range for both the speed (y-axis) and accuracy (x-axis). RAMICS' speed 
performance remains competitive with some of the fastest mappers, while its quality performance outstrips them significantly. 



and, thus, much of the overhead necessary to hash a large 
genome is redundant in these analyses. RAMICS, on the 
other hand, is designed to perform a full dynamic program- 
ming alignment to a shorter reference sequence quickly and 
accurately. 

Comparing the RAMICS GPU version to the EMBOSS 
tool needle (45), which performs a full dynamic program- 
ming alignment, shows that RAMICS is roughly twice as 
fast (Figure 6). Further, the efficiency of RAMICS is evi- 
dent in that the codon-based approach implemented means 
it must perform eight comparisons to needle's three, and 
makes extensive use of floating point arithmetic to achieve 
the fine-grained distinctions required for its more com- 
plicated alignment model. Indeed, RAMICS non-coding, 
which performs a Needleman-Wunsch style alignment but 
takes into account homopolymer errors, is itself twice as fast 
as the full RAMICS algorithm (data not shown). 

There should, however, be no trade-off in quality for 
speed or vice versa and, thus, the optimal mapping ap- 
proach is one that can map the greatest number of reads 
with the highest accuracy in the shortest time. By evaluat- 
ing each mapping approach in terms of speed and accuracy 
together (Figure 6), we find that the GPU-based version of 
the full RAMICS codon-based algorithm outperforms all 



other approaches suggesting that, for amplicon-style anal- 
yses at least, many of the other approaches exhibit such a 
trade-off 

To our knowledge, RAMICS is the first mapping tool 
that undertakes accurate mapping of coding DNA to a 
reference sequence at speed, while concurrently identifying 
and correcting sequencing platform-induced errors. RAM- 
ICS' accuracy in generating biologically correct alignments 
of coding DNA means that it can be applied wherever the 
generation of high-quality alignments for sensitive detec- 
tion of SNPs and novel variants are required. 



AVAILABILITY 

The RAMICS software (University of the Western Cape, 
Copyright Reserved, 2013) is available under license from 
the University of the Western Cape, South Africa at http: 
//hiv.sanbi.ac.za/tools#/ramics. Terms of the license will in- 
clude but are not limited to the following: an annual up- 
front licensing fee for use only will be applicable for any use 
of the software for commercial purposes (to be defined), and 
a Tree use' only license will be available to verified and ap- 
proved academic institutions and public not-for-profit re- 
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search organizations for non-commercial use and/or appli- 
cation only. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online, includ- 
ing [1-19]. 
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